generated in 2017-05-08 14:29:18
Project_id: CA-PM-20170506_p1p4
sync time
Start: Mon May 8 09:20:01 CST 2017
Finish: Mon May 8 11:03:01 CST 2017
Time: 1.71667h
data_size: 98.76G ; speed: 15.98M/s
split time
Start: Mon May 8 11:02:09 CST 2017
Finish: Mon May 8 13:19:02 CST 2017
Time: 2.28139h
split data
unmatched 48119.9 Mb
matched 105634 Mb
total 153754 Mb
low_data samples 0
sync time
Start: Thu May 4 05:40:01 CST 2017
Finish: Thu May 4 06:54:53 CST 2017
Time: 1.24778h
data_size: 66.79G ; speed: 14.87M/s
split time
Start: Thu May 4 06:56:36 CST 2017
Finish: Thu May 4 10:32:29 CST 2017
Time: 3.59806h
split data
unmatched 36732 Mb
matched 93732 Mb
total 130464 Mb
low_data samples 0
sync time
Start: Sat May 6 02:20:01 CST 2017
Finish: Sat May 6 03:19:30 CST 2017
Time: 0.991389h
data_size: 50.86G ; speed: 14.25M/s
split time
Start: Sat May 6 03:22:48 CST 2017
Finish: Sat May 6 05:27:50 CST 2017
Time: 2.08389h
split data
unmatched 27080.1 Mb
matched 73056.4 Mb
total 100137 Mb
low_data samples 0
Round 1: Mon May 8 13:18:35 CST 2017
Round 2: Mon May 8 13:17:27 CST 2017
Finish: Mon May 8 14:26:23 CST 2017
Time: 1.13h
OG173110002N1LEUD2kx9b2
OG175310093N1LEUD2kx9b2
OG178510071N1LEUD2kx9b2
OG160200008N1LEUD2kx9b2 on capture_depth_1X with 608 outside 1000.0 to 4000
OG160200008N1LEUD2kx9b2 on seq_depth with 1485.83 outside 2000 to 6000
OG170250097N1LEUD2kx9b1 on capture_depth_1X with 638 outside 1000.0 to 4000
OG170250097N1LEUD2kx9b1 on %(G+C) with 51.06 outside 40 to 50
OG170250097N1LEUD2kx9b1 on seq_depth with 1435.96 outside 2000 to 6000
OG170250097T1HYTD2kx9b2 on capture_depth_1X with 756 outside 1000.0 to 4000
OG170250097T1HYTD2kx9b2 on seq_depth with 1824.8 outside 2000 to 6000
OG170250216N1LEUD2kx9b1 on capture_depth_1X with 617 outside 1000.0 to 4000
OG170250216N1LEUD2kx9b1 on %(G+C) with 50.74 outside 40 to 50
OG170250216N1LEUD2kx9b1 on seq_depth with 1131.57 outside 2000 to 6000
OG170250216T1FRED2kx9b2 on capture_depth_1X with 796 outside 1000.0 to 4000
OG170250216T1FRED2kx9b2 on seq_depth with 1868.89 outside 2000 to 6000
OG173710152N1LEUD2kx9b2 on capture_depth_1X with 986 outside 1000.0 to 4000
OG174310048N1LEUD2kx9b2 on capture_depth_1X with 739 outside 1000.0 to 4000
OG174310048N1LEUD2kx9b2 on seq_depth with 1763.07 outside 2000 to 6000
OG174310053N1LEUD2kx9b2 on capture_depth_1X with 829 outside 1000.0 to 4000
OG174310053N1LEUD2kx9b2 on seq_depth with 1931.95 outside 2000 to 6000
OG177910001N1LEUD2kx9b2 on capture_depth_1X with 746 outside 1000.0 to 4000
OG177910001N1LEUD2kx9b2 on seq_depth with 1754.58 outside 2000 to 6000
OG178510059N1LEUD2kx9b2 on capture_depth_1X with 912 outside 1000.0 to 4000
汇总了现有 CN500 测序仪肿瘤样本测序数据的结果, 并从以下角度和步骤进行总结
质控指标 32 个:
prj_sample, sample, size_Gb, GC, N, Q20, Q30, low_qual_filter, adapter_filter, undersize_ins_filter, duplicated_filter, clean_size_Gb, clean_GC, clean_N, clean_Q20, clean_Q30, coverage, mapping_rate, coverage_cent, specificity_cent, uniformity_cent, panel_dep, samtools_dups, insert_size, seq_dep, trim_adapter, mut_dep, sample_type, panel_type, date, eff_seq, eff_mut
关键指标:
seq_dep: 下机数据量 / panel 大小, 即得到的预期深度panel_dep: 捕获到的 panel 内深度eff_seq: panel_dep/seq_dep, 即实验环节的数据利用率mut_dep: 分析时得到的突变位点的平均深度eff_mut: mut_dep/panel_dep, 即分析环节的数据利用率各环节关注的质控因素:
最新一批 20170506_p1p4 共有 13 个样本
##
## LEU OTHER
## 11 2
## prj_sample sample
## 1 CA-PM-20170506_p1p4/OG160200008N1LEUD2kx9b2 OG160200008N1LEUD2kx9b2
## 2 CA-PM-20170506_p1p4/OG170250097N1LEUD2kx9b1 OG170250097N1LEUD2kx9b1
## 3 CA-PM-20170506_p1p4/OG170250097T1HYTD2kx9b2 OG170250097T1HYTD2kx9b2
## 4 CA-PM-20170506_p1p4/OG170250216N1LEUD2kx9b1 OG170250216N1LEUD2kx9b1
## 5 CA-PM-20170506_p1p4/OG170250216T1FRED2kx9b2 OG170250216T1FRED2kx9b2
## 6 CA-PM-20170506_p1p4/OG173110002N1LEUD2kx9b2 OG173110002N1LEUD2kx9b2
## 7 CA-PM-20170506_p1p4/OG173710152N1LEUD2kx9b2 OG173710152N1LEUD2kx9b2
## 8 CA-PM-20170506_p1p4/OG174310048N1LEUD2kx9b2 OG174310048N1LEUD2kx9b2
## 9 CA-PM-20170506_p1p4/OG174310053N1LEUD2kx9b2 OG174310053N1LEUD2kx9b2
## 10 CA-PM-20170506_p1p4/OG175310093N1LEUD2kx9b2 OG175310093N1LEUD2kx9b2
## 11 CA-PM-20170506_p1p4/OG177910001N1LEUD2kx9b2 OG177910001N1LEUD2kx9b2
## 12 CA-PM-20170506_p1p4/OG178510059N1LEUD2kx9b2 OG178510059N1LEUD2kx9b2
## 13 CA-PM-20170506_p1p4/OG178510071N1LEUD2kx9b2 OG178510071N1LEUD2kx9b2
## size_Gb GC N Q20 Q30 low_qual_filter adapter_filter
## 1 0.0827 49.890 0.115 89.150 82.990 7.12940 0.456703
## 2 0.0799 51.060 0.100 95.200 92.045 3.31005 0.207781
## 3 0.1016 46.645 0.105 89.560 83.345 6.61774 0.283518
## 4 0.0630 50.740 0.140 95.955 92.870 3.00972 0.141257
## 5 0.1040 49.925 0.105 90.010 84.135 6.00480 0.389828
## 6 0.1332 48.775 0.110 90.175 84.285 6.13978 0.444264
## 7 0.1286 49.805 0.115 90.145 84.315 6.04230 0.367861
## 8 0.0981 48.170 0.105 90.090 84.120 6.19061 0.310002
## 9 0.1075 46.345 0.110 89.875 83.755 6.53646 0.282496
## 10 0.1406 47.750 0.110 89.815 83.730 6.65286 0.319571
## 11 0.0977 44.870 0.115 90.295 84.290 6.22008 0.251958
## 12 0.1211 49.225 0.105 90.420 84.690 5.80633 0.294168
## 13 0.1426 46.780 0.100 90.280 84.390 5.79150 0.299274
## undersize_ins_filter duplicated_filter clean_size_Gb clean_GC clean_N
## 1 0 4.53609 0.0726 50.065 False
## 2 0 10.93310 0.0684 51.065 False
## 3 0 4.18363 0.0903 46.805 False
## 4 0 7.48346 0.0562 50.800 False
## 5 48.2164 2.95642 0.0411 51.060 False
## 6 0 5.85512 0.1165 48.905 False
## 7 0 5.80904 0.1128 49.950 False
## 8 0 5.27317 0.0865 48.325 False
## 9 0 5.88394 0.0938 46.570 False
## 10 0 5.63580 0.1228 47.935 False
## 11 0 6.22530 0.0852 45.060 False
## 12 0 5.79539 0.1066 49.375 False
## 13 0 5.86790 0.1255 46.915 False
## clean_Q20 clean_Q30 coverage mapping_rate coverage_cent
## 1 91.055 85.140 0.0025 0.9979 0.9609
## 2 96.115 93.055 0.0029 0.9987 0.9267
## 3 91.265 85.275 0.0029 0.9982 0.9996
## 4 96.865 93.895 0.0010 0.9983 0.9359
## 5 91.485 86.075 0.0027 0.9981 0.9160
## 6 91.755 86.060 0.0026 0.9980 0.9723
## 7 91.720 86.080 0.0025 0.9981 0.9555
## 8 91.680 85.910 0.0027 0.9983 0.9817
## 9 91.525 85.620 0.0021 0.9982 0.9997
## 10 91.510 85.645 0.0038 0.9980 0.9887
## 11 91.860 86.055 0.0020 0.9982 0.9996
## 12 91.950 86.400 0.0025 0.9983 0.9697
## 13 91.755 86.040 0.0027 0.9983 0.9954
## specificity_cent uniformity_cent panel_dep samtools_dups insert_size
## 1 0.6059 0.9979 608 0.1448 261.0
## 2 0.6633 0.9963 638 0.1640 181.9
## 3 0.5857 0.9999 756 0.1341 187.3
## 4 0.6649 0.9967 617 0.1232 188.1
## 5 0.6146 0.9967 796 0.1526 172.9
## 6 0.5855 0.9985 1021 0.1642 180.5
## 7 0.5855 0.9977 986 0.1640 179.3
## 8 0.6049 0.9989 739 0.1497 177.2
## 9 0.5838 0.9999 829 0.1579 180.7
## 10 0.5969 0.9993 1062 0.1597 183.0
## 11 0.5741 0.9999 746 0.1600 183.9
## 12 0.5783 0.9979 912 0.1569 180.4
## 13 0.5850 0.9996 1105 0.1645 181.9
## seq_dep trim_adapter mut_dep sample_type panel_type date
## 1 1485.83 0.0622592 464 LEU p1p4 20170506_p1p4
## 2 1435.96 0.0614643 NA LEU p1p4 20170506_p1p4
## 3 1824.80 0.0521457 1332 OTHER p1p4 20170506_p1p4
## 4 1131.57 0.0522369 NA LEU p1p4 20170506_p1p4
## 5 1868.89 0.0700785 1347 OTHER p1p4 20170506_p1p4
## 6 2391.36 0.0593041 750 LEU p1p4 20170506_p1p4
## 7 2309.67 0.0602870 747 LEU p1p4 20170506_p1p4
## 8 1763.07 0.0651162 549 LEU p1p4 20170506_p1p4
## 9 1931.95 0.0603952 561 LEU p1p4 20170506_p1p4
## 10 2524.71 0.0540815 756 LEU p1p4 20170506_p1p4
## 11 1754.58 0.0558601 538 LEU p1p4 20170506_p1p4
## 12 2175.44 0.0603850 765 LEU p1p4 20170506_p1p4
## 13 2562.49 0.0586892 726 LEU p1p4 20170506_p1p4
## eff_seq eff_mut
## 1 0.4091989 0.7631579
## 2 0.4443021 NA
## 3 0.4142920 1.7619048
## 4 0.5452601 NA
## 5 0.4259213 1.6922111
## 6 0.4269537 0.7345739
## 7 0.4269008 0.7576065
## 8 0.4191552 0.7428958
## 9 0.4291001 0.6767189
## 10 0.4206424 0.7118644
## 11 0.4251730 0.7211796
## 12 0.4192255 0.8388158
## 13 0.4312212 0.6570136
最近5批数据 20170506_p1p4, 20170504_p1p4, 20170504_p1p2p4, 20170502_p1p4, 20170502_p1p2p4
总计 179 个样本
按样本类型, panel 类型, 日期批次进行计数
table(dd$sample_type)
##
## CF FFPE LEU OTHER
## 39 3 130 7
table(dd$panel_type)
##
## p1p2p4 p1p4
## 25 154
table(dd$date)
##
## 20170502_p1p2p4 20170502_p1p4 20170504_p1p2p4 20170504_p1p4
## 15 89 10 52
## 20170506_p1p4
## 13
thre_low <- 0.01
date_index <- first_index(dd$date)
ggplot(dd) + geom_point(aes(x = seq(nrow(dd)), y = log10(size_Gb * 1000), color = group_as_two(date))) + geom_hline(yintercept = log10(thre_low * 1000), color = 'red') + geom_text(aes(x = nrow(dd) / 2, label = paste("threshold:", thre_low, "Gb"), y = log10(thre_low * 1000) - 0.1)) + labs(title = 'low data size') + annotate("text", x = date_index, y = rep(thre_low, length(date_index)), label = dd$date[date_index], angle = 90, alpha = 0.5) + theme(legend.position = "none")
dd[which(dd$size_Gb < thre_low), c('prj_sample', 'size_Gb')]
## [1] prj_sample size_Gb
## <0 rows> (or 0-length row.names)
dd <- dd[which(dd$size_Gb > thre_low), ]
去除低数据量样本 (下机数据量低于 0.01 Gb) 后得到 179 个样本
去除低数据量样本后的分类统计:
table(dd$sample_type)
##
## CF FFPE LEU OTHER
## 39 3 130 7
table(dd$panel_type)
##
## p1p2p4 p1p4
## 25 154
table(dd$date)
##
## 20170502_p1p2p4 20170502_p1p4 20170504_p1p2p4 20170504_p1p4
## 15 89 10 52
## 20170506_p1p4
## 13
dd$expect_size <- c(unlist(sapply(dd$sample_type, expect_size)))
ggplot(dd) + geom_boxplot(aes(date, size_Gb), alpha = 0.3) + geom_violin(aes(date, size_Gb), alpha = 0.6) + geom_jitter(aes(date, size_Gb)) +geom_hline(aes(yintercept=0.3*expect_size),color="red",alpha=0.6) + geom_hline(aes(yintercept=2*expect_size),color="red",alpha=0.6) + labs(title = 'data size by sample_type') + theme(legend.position = "none", axis.text.x=element_text(angle=20)) + facet_wrap(~sample_type, scales = 'free')
## Warning in max(data$density): no non-missing arguments to max; returning -
## Inf
## Warning in max(data$density): no non-missing arguments to max; returning -
## Inf
#ggplot(dd) + geom_boxplot(aes(sample_type, size_Gb), alpha = 0.3) + geom_violin(aes(sample_type, size_Gb), alpha = 0.6) + geom_jitter(aes(sample_type, size_Gb)) + labs(title = 'data size by date') + theme(legend.position = "none") + facet_wrap(~date, scales = 'free')
下机数据量的批次间稳定性
thre_eff_seq <- 0.33
ggplot(dd) + geom_boxplot(aes(sample_type, eff_seq), alpha = 0.3) + geom_violin(aes(sample_type, eff_seq), alpha = 0.6) + geom_jitter(aes(sample_type, eff_seq)) + geom_hline(yintercept = thre_eff_seq, color = 'red') + labs(title = 'panel_dep/seq_dep by sample_type')
ggplot(dd) + geom_boxplot(aes(sample_type, eff_seq), alpha = 0.3) + geom_violin(aes(sample_type, eff_seq), alpha = 0.6) + geom_jitter(aes(sample_type, eff_seq)) + geom_hline(yintercept = thre_eff_seq, color = 'red') + labs(title = 'panel_dep/seq_dep by sample_type by date') + facet_wrap(~date, scales = 'free_y')
下机数据量利用率的稳定性: mean: 0.3840941, SD: 0.1576461; 其中 CFDNA 的 mean: 0.3924713, SD: 0.1247553
ggplot(dd) + geom_point(aes(seq_dep, panel_dep)) + geom_smooth(aes(seq_dep, panel_dep)) + facet_wrap(~sample_type, scales = 'free') + labs(title = 'seq eff')
## `geom_smooth()` using method = 'loess'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 864.46
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 3.3025
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 3.9167e+05
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.
## fewer data values than degrees of freedom.
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 864.46
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 3.3025
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 3.9167e+05
ggplot(dd) + geom_point(aes(seq_dep, eff_seq, color = date)) + geom_smooth(aes(seq_dep, eff_seq)) + labs(title = 'seq_eff along with seq_dep') + facet_wrap(~sample_type, scales = c('free'))
## `geom_smooth()` using method = 'loess'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : span too small. fewer data values than degrees of freedom.
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 864.46
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 3.3025
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 3.9167e+05
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : span too small.
## fewer data values than degrees of freedom.
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used
## at 864.46
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 3.3025
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal
## condition number 0
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other
## near singularities as well. 3.9167e+05
深度大则有效数据量多, 但利用率无明显线性相关性, 因为波动幅度较大 (且饱和极限深度为 10000X)
ggplot(dd) + geom_point(aes(GC, eff_seq, color = date, size = panel_type, shape = sample_type), alpha = 0.6) + stat_smooth(aes(GC, eff_seq)) + labs(title = 'eff_seq ~ GC')
## `geom_smooth()` using method = 'loess'
ggplot(dd) + geom_point(aes(GC, eff_seq, color = sample_type, size = panel_type), alpha = 0.6) + stat_smooth(aes(GC, eff_seq)) + labs(title = 'eff_seq ~ GC by sample') + facet_wrap(~sample_type, scales = "free_x")
## `geom_smooth()` using method = 'loess'
ggplot(dd) + geom_point(aes(GC, eff_seq, color = sample_type, size = panel_type), alpha = 0.6) + stat_smooth(aes(GC, eff_seq)) + labs(title = 'eff_seq ~ GC by sample by date') + facet_wrap(~date, ncol = 3, scales = "free")
## `geom_smooth()` using method = 'loess'
GC 含量影响数据利用率, 合理范围为预期 GC% +- 2% (如 NIPT)。
thre_q30 <- 0.8
ggplot(dd) + geom_boxplot(aes(sample_type, Q30), alpha = 0.3) + geom_violin(aes(sample_type, Q30), alpha = 0.6) + geom_jitter(aes(sample_type, Q30), size = 0.2) + geom_hline(yintercept = 80, color = 'red') + labs(title = ' Q30 by date') + facet_wrap(~date, scales = c('free'))
Q30 阈值: 0.8
ggplot(dd_sub) + geom_boxplot(aes(sample_type, specificity_cent), alpha = 0.3) + geom_violin(aes(sample_type, specificity_cent), alpha = 0.6) + geom_jitter(aes(sample_type, specificity_cent)) + labs(title = 'specificity_cent by sample_type by date') + facet_wrap(~date, scales = 'free_y') + geom_hline(yintercept = 0.6, color = "red")
基本正常
thre_adapter <- 0.2
ggplot(dd) + geom_boxplot(aes(sample_type, trim_adapter), alpha = 0.3) + geom_violin(aes(sample_type, trim_adapter), alpha = 0.6) + geom_jitter(aes(sample_type, trim_adapter)) + labs(title = 'trim_adapter by sample_type by date') + facet_wrap(~date, scales = 'free_y') + geom_hline(yintercept = thre_adapter, color = "red")
正常
thre_eff_mut <- 0.2
ggplot(dd) + geom_boxplot(aes(sample_type, eff_mut), alpha = 0.3) + geom_violin(aes(sample_type, eff_mut), alpha = 0.6) + geom_jitter(aes(sample_type, eff_mut), size = 0.2) + geom_hline(yintercept = thre_eff_mut, color = 'red') + labs(title = 'mut_dep/panel_dep by sample_type')
ggplot(dd) + geom_boxplot(aes(sample_type, eff_mut), alpha = 0.3) + geom_violin(aes(sample_type, eff_mut), alpha = 0.6) + geom_jitter(aes(sample_type, eff_mut), size = 0.2) + geom_hline(yintercept = thre_eff_mut, color = 'red') + labs(title = 'mut_dep/panel_dep by sample_type by date') + facet_wrap(~date, ncol = 3, scales = 'free_y')
分析环节数据量利用率的稳定性: mean: 0.8516142, SD: 0.3797517; 其中CFDNA 的 mean: 0.6043703, SD: 0.4661001
thre_dups <- 0.6
ggplot(dd) + geom_point(aes(samtools_dups, eff_mut, color = sample_type, size = eff_seq)) + geom_smooth(aes(samtools_dups, eff_mut)) + labs(title = 'seq_mut along with samtools_dups') + facet_wrap(~sample_type, scales = c('free'))
## `geom_smooth()` using method = 'loess'
ggplot(dd) + geom_boxplot(aes(sample_type, samtools_dups), alpha = 0.3) + geom_violin(aes(sample_type, samtools_dups), alpha = 0.6) + geom_jitter(aes(sample_type, samtools_dups), size = 0.2) + geom_hline(yintercept = thre_dups, color = 'red') + labs(title = 'samtools_dups by date') + facet_wrap(~date, scales = c('free'))
dups 会影响分析利用率